33 research outputs found

    Fast and efficient statistical methods for detecting genetic admixture events and its applications in large-scale data cohorts

    Get PDF
    Present-day cohorts of genome-wide DNA provide a powerful means of elucidating admixture events where different human groups intermixed, providing new insights into human history and population movements. The method GLOBETROTTER (Hellenthal et al., 2014) shows increased precision over other available techniques for characterising admixture due to modelling haplotype information, i.e. associations among tightly linked Single Nucleotide Polymorphisms (SNPs). However, because of its computational demands, GLOBETROTTER can only handle relatively small sample sizes of tens to hundreds of admixed individuals. In this thesis, I present a new statistical method, fastGLOBETROTTER, that both reduces computational time and increases accuracy relative to GLOBETROTTER. In particular, fastGLOBETROTTER more efficiently models admixture linkage disequilibrium by sampling sets of genomic regions within individuals that are the most informative for admixture events. Additionally, I have developed an algorithm for allocating memory more efficiently to enable a factor of up to 20 fold improvement in computation time relative to GLOBETROTTER. Therefore, this technique can cope with the rapidly emerging large-scale cohorts of genetically homogeneous populations sampled from small geographic regions, e.g. within a country (China Kadoorie Biobank, UK Biobank), to provide more precise estimates of admixture dates. Via simulations, I use fastGLOBETROTTER to demonstrate the sample sizes required to characterize admixture between groups with high levels of genetic similarity, and the time depths for which these approaches can reliably detect such past intermixing. I also apply fastGLOBETROTTER to over 6000 European individuals, using over 2500 individuals as ancestry surrogates, revealing new insights into admixture across Western Europe. These include admixture events dated to ∼500-600 CE from sources carrying DNA related to present-day West Asian and North African populations found in individuals within France, Belgium and parts of Germany. I also report admixture from East-Asian/Siberian-like sources in individuals within Finland, Norway and Sweden at different times starting ∼1900 years ago

    An efficient method to identify, date, and describe admixture events using haplotype information

    Get PDF
    We present fastGLOBETROTTER, an efficient new haplotype-based technique to identify, date, and describe admixture events using genome-wide autosomal data. With simulations, we demonstrate how fastGLOBETROTTER reduces computation time by an order of magnitude relative to the related technique GLOBETROTTER without suffering loss of accuracy. We apply fastGLOBETROTTER to a cohort of >6000 Europeans from ten countries, revealing previously unreported admixture signals. In particular we infer multiple periods of admixture related to East Asian or Siberian-like sources, starting >2000 years ago, in people living in countries north of the Baltic Sea. In contrast, we infer admixture related to West Asian, North African and/or Southern European sources in populations south of the Baltic Sea, including admixture dated to ≈300-700CE, overlapping the fall of the Roman Empire, in people from Belgium, France and parts of Germany. Our new approach scales to analyzing hundreds to thousands of individuals from a putatively admixed population and hence is applicable to emerging large-scale cohorts of genetically homogeneous populations

    WASP: a Web-based Allele-Specific PCR assay designing tool for detecting SNPs and mutations

    Get PDF
    BACKGROUND: Allele-specific (AS) Polymerase Chain Reaction is a convenient and inexpensive method for genotyping Single Nucleotide Polymorphisms (SNPs) and mutations. It is applied in many recent studies including population genetics, molecular genetics and pharmacogenomics. Using known AS primer design tools to create primers leads to cumbersome process to inexperience users since information about SNP/mutation must be acquired from public databases prior to the design. Furthermore, most of these tools do not offer the mismatch enhancement to designed primers. The available web applications do not provide user-friendly graphical input interface and intuitive visualization of their primer results. RESULTS: This work presents a web-based AS primer design application called WASP. This tool can efficiently design AS primers for human SNPs as well as mutations. To assist scientists with collecting necessary information about target polymorphisms, this tool provides a local SNP database containing over 10 million SNPs of various populations from public domain databases, namely NCBI dbSNP, HapMap and JSNP respectively. This database is tightly integrated with the tool so that users can perform the design for existing SNPs without going off the site. To guarantee specificity of AS primers, the proposed system incorporates a primer specificity enhancement technique widely used in experiment protocol. In particular, WASP makes use of different destabilizing effects by introducing one deliberate 'mismatch' at the penultimate (second to last of the 3'-end) base of AS primers to improve the resulting AS primers. Furthermore, WASP offers graphical user interface through scalable vector graphic (SVG) draw that allow users to select SNPs and graphically visualize designed primers and their conditions. CONCLUSION: WASP offers a tool for designing AS primers for both SNPs and mutations. By integrating the database for known SNPs (using gene ID or rs number), this tool facilitates the awkward process of getting flanking sequences and other related information from public SNP databases. It takes into account the underlying destabilizing effect to ensure the effectiveness of designed primers. With user-friendly SVG interface, WASP intuitively presents resulting designed primers, which assist users to export or to make further adjustment to the design. This software can be freely accessed at http://bioinfo.biotec.or.th/WASP

    Iterative pruning PCA improves resolution of highly structured populations

    Get PDF
    BACKGROUND: Non-random patterns of genetic variation exist among individuals in a population owing to a variety of evolutionary factors. Therefore, populations are structured into genetically distinct subpopulations. As genotypic datasets become ever larger, it is increasingly difficult to correctly estimate the number of subpopulations and assign individuals to them. The computationally efficient non-parametric, chiefly Principal Components Analysis (PCA)-based methods are thus becoming increasingly relied upon for population structure analysis. Current PCA-based methods can accurately detect structure; however, the accuracy in resolving subpopulations and assigning individuals to them is wanting. When subpopulations are closely related to one another, they overlap in PCA space and appear as a conglomerate. This problem is exacerbated when some subpopulations in the dataset are genetically far removed from others. We propose a novel PCA-based framework which addresses this shortcoming. RESULTS: A novel population structure analysis algorithm called iterative pruning PCA (ipPCA) was developed which assigns individuals to subpopulations and infers the total number of subpopulations present. Genotypic data from simulated and real population datasets with different degrees of structure were analyzed. For datasets with simple structures, the subpopulation assignments of individuals made by ipPCA were largely consistent with the STRUCTURE, BAPS and AWclust algorithms. On the other hand, highly structured populations containing many closely related subpopulations could be accurately resolved only by ipPCA, and not by other methods. CONCLUSION: The algorithm is computationally efficient and not constrained by the dataset complexity. This systematic subpopulation assignment approach removes the need for prior population labels, which could be advantageous when cryptic stratification is encountered in datasets containing individuals otherwise assumed to belong to a homogenous population

    Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis.</p> <p>Results</p> <p>A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA.</p> <p>Conclusions</p> <p>The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from <url>http://www4a.biotec.or.th/GI/tools/ippca</url>.</p

    Genetic analysis of Thai cattle reveals a Southeast Asian indicine ancestry

    Get PDF
    Cattle commonly raised in Thailand have characteristics of [i]Bos indicus[/i] (zebu). We do not know when or how cattle domestication in Thailand occurred, and so questions remain regarding their origins and relationships to other breeds. We obtained genome-wide SNP genotypic data of 28 bovine individuals sampled from four regions: North (Kho-Khaolampoon), Northeast (Kho-Isaan), Central (Kho-Lan) and South (Kho-Chon) Thailand. These regional varieties have distinctive traits suggestive of breed-like genetic variations. From these data, we confirmed that all four Thai varieties are [i]Bos indicus[/i] and that they are distinct from other indicine breeds. Among these Thai cattle, a distinctive ancestry pattern is apparent, which is the purest within Kho-Chon individuals. This ancestral component is only present outside of Thailand among other indicine breeds in Southeast Asia. From this pattern, we conclude that a unique [i]Bos indicus[/i] ancestor originated in Southeast Asia, and native Kho-Chon Thai cattle retain the signal of this ancestry with limited admixture of other bovine ancestors

    Encoded haplotype data as input to ipPCA can better resolve population clustering

    Full text link
    Background Studies in population genetics are mainly based on the analysis of genetic variations among different populations. With the advent of advanced genotyping technology, large number of Single Nucleotide Polymorphisms (SNPs) can be used to capture the underlying population variations. Iterative pruning principal component analysis (ipPCA) is a very powerful tool to cluster subpopulations based on their SNP profiles. However, when several similar populations are considered in the analysis, differentiating these populations can become very challenging. Haplotype has been known to capture more segregation information and higher power than SNP but due to high inference complexity, this concept has not been widely used. Recently, haplotype sharing (HS) was reported as a good alternative method to evaluate variation among populations. HS interrogates the entire genotyping without estimating haplotype block, making it computational efficient, yet retaining population profile. Adopting HS technique and introducing a new haplotype encoding as the input to ipPCA to perform population clustering can yield very good outcomes. Results In this study we transformed an indigenous Thai SNP genotyping data, obtained from Pan Asian SNP consortium, into encoded haplotype profiles. The dataset include 13 indigenous populations (245 individuals) composing of approximately 54K SNPs for each individual. To do this, an encoded haplotype matrix was constructed by inferring overlapping haplotype based on sliding window approach in BEAGLE, an efficient haplotype inference tool. We fed this encoded haplotype matrix to ipPCA to cluster these individuals into sub-groups using only their genetic profiles. We compared the results obtained from standard protocol of ipPCA with the one that use the encoded haplotype matrix in terms of numbers of clustered subpopulations as well as the accuracy to correctly assign an individual to a correct subpopulation. Using the encoded haplotype matrix as input to ipPCA rendered the exact 13 subpopulations to be clustered with 99.18% of individual assignment accuracy, whereas the conventional ipPCA identified only 10 subpopulations with 93.47% of individual assignment accuracy. Conclusions Our result demonstrated the great potential of using the encoded haplotype matrix with ipPCA for population genetics studies. This new protocol can promote the clustering of individuals using only their genetic profiles
    corecore